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Abstract: Wc consider the regression model with observation error in the 
design: 

V = xe*+z, 

Z = X + E. 

Here the random vector y £ 1" and the random nxp matrix Z are observed, 
the nxp matrix X is unknown, S is an n X p random noise matrix, £ £ M™ 
is a random noise vector, and 8* is a vector of unknown parameters to be 
estimated. We consider the setting where the dimension p can be much 
larger than the sample size n and 0* is sparse. Because of the presence 
of the noise matrix S, the commonly used Lasso and Dantzig selector are 
unstable. An alternative procedure called the Matrix Uncertainty (MU) 
selector has been proposed in Rosenbaum and Tsybakov (2010) in order to 
account for the noise. The properties of the MU selector have been studied 
in Rosenbaum and Tsybakov (2010) for sparse 8* under the assumption 
that the noise matrix H is deterministic and its values are small. In this 
paper, wc propose a modification of the MU selector when S is a random 
matrix with zero-mean entries having the variances that can be estimated. 
This is, for example, the case in the model where the entries of X are missing 
at random. We show both theoretically and numerically that, under these 
conditions, the new estimator called the Compensated MU selector achieves 
better accuracy of estimation than the original MU selector. 
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1. Introduction 

We consider the model 

y = xe*+i, (l) 

Z = X + E, (2) 

where the random vector y £ R™ and the random nxp matrix Z arc observed, 
the nxp matrix X is unknown, S is an nxp random noise matrix, £ £ E" is a ran- 
dom noise vector, 9* = (0*, . . . , 6*) £ © is a vector of unknown parameters to be 
estimated, and 9 is a given subset of R p . We consider the problem of estimating 
an s-sparse vector 9* (i.e., a vector 9* having only s non zero components), with 
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p possibly much larger than n. If the matrix X in (l)-(2) is observed without 
error (5 = 0), this problem has been recently studied in numerous papers. The 
proposed estimators mainly rely on l\ minimization techniques. In particular, 
this is the case for the widely used Lasso and Dantzig selector, see among others 
Candcs and Tao (2007), Bunea et al. (2007a,b), Bickel et al. (2009), Koltchin- 
skii (2009), the book by Biihlmann and van de Geer (2011), the lecture notes by 
Koltchinskii (2011), Belloni and Chernozhukov (2011) and the references cited 
therein. 

However, it is shown in Rosenbaum and Tsybakov (2010) that dealing with 
a noisy observation of the regression matrix X has severe consequences. In 
particular, the Lasso and Dantzig selector become very unstable in this context. 
An alternative procedure, called the matrix uncertainty selector (MU selector 
for short) is proposed in Rosenbaum and Tsybakov (2010) in order to account 
for the presence of noise 5. The MU selector 9 MU is defined as a solution of the 
minimization problem 



where | • \ p denotes the £ p -norm, l<p<oo,9isa given subset of R p char- 
acterizing the prior knowledge about 9* , and the constants fi and r depend on 
the level of the noises H and £ respectively. If the noise terms £ and S are deter- 
ministic, it is suggested in Rosenbaum and Tsybakov (2010) to choose r such 
that 



and to take \x = 5(1 + 6) with 6 such that 



where, for a matrix A, we denote by |^4|oo its componentwise ^-norm. 

In this paper, we propose a modification of the MU selector for the model 
where S is a random matrix with independent and zero mean entries such 
that the sums of expectations 



are finite and admit data-driven estimators. Our main example where such es- 
timators exist is the model with data missing at random (see below). The idea 
underlying the new estimator is the following. In the ideal setting where there 
is no noise S, the estimation strategy for 9* is based on the matrix X. When 
there is noise this is impossible since X is not observed and so we have no other 
choice than using Z instead of X. However, it is not hard to see that under 
the above assumptions on S, the matrix Z T Z/n appearing in (3) contains a 
bias induced by the diagonal entries of the matrix S T S/n whose expectations 
o 2 - do not vanish. If a 2 can be estimated from the data, it is natural to make 



min{|0|i: 9 e 6, -Z T (y - Z9) < M |#| 1+T }, 



(3) 
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a bias correction. This leads to a new estimator defined as a solution of the 
minimization problem 

1 



min{|0|i : OeQ, -Z 1 (y - Z0) + D0 <a*|#|i+t}, 



(4) 



where D is the diagonal matrix with entries crj, which are estimators of cr 2 , 
and \i > and r > are constants that will be specified later. This estimator 
will be called the Compensated MU selector. In this paper, we show both 
theoretically and numerically that the estimator achieves better performance 
than the original MU selector MU . In particular, under natural conditions 
given below, the bounds on the error of the Compensated MU selector decrease 
as 0(n -1 / 2 ) up to logarithmic factors as n — )■ oo, whereas for the original MU 
selector MU the corresponding bounds do not decrease with n and can be only 
small if the noise S is small. 



Remark 1. The problem (4) is equivalent to 

\0\i, 



mm 

(B,u)gW(ii,t) 



(5) 



where 

W(h,t) = {(0,u) G 6 x R p 



Z T (y - Z0) + D0 + u 



< r, \u , 



< 



/Mi}, 



, (6) 

with the same /i and r as in (4) (see the proof in Section 7). This simplifies in 
some cases the computation of the solution. 

An important example where the values a| can be estimated is given by the 
model with missing data. Assume that the elements Xij of the matrix X are 
unobservable, and we can only observe 



,P, 



(7) 



where for each fixed j = l,...,p, the factors r/ij, i = 1, . . . , n, are i.i.d. Bernoulli 
random variables taking value 1 with probability 1— Kj and with probability ttj, 
< Wj < 1. The data Xij is missing if r]ij = 0, which happens with probability 
TTj . We can rewrite (7) in the form 



Zij 



(8) 



where Z^ = — ttj), Sjj = Xijfaj — (1 — 7Tj))/(l — TTj). Thus, we can 

reduce the model with missing data (7) to the form (2) with a matrix S whose 
elements 5^ have zero mean and variance XfjTTj/(l — TTj). So, 



(9) 



In Section 4 below, we show that when the TTj are known, the <r| admit good 
data-driven estimators <r? If the TTj arc unknown, they can be readily estimated 
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by the empirical frequencies of that we further denote by ttj . Then the Z^ = 
Zij/(1 — TTj ) appearing in (8) are not available and should be replaced by Zij — 
Zij/(1 — TTj). This slightly changes the model and implies a minor modification 
of the estimator (cf. Section 4) . 

2. Definitions and notation 

Consider the following random matrices 

M« = -X T E, = l X T i, = 1 s T ^, 

n n n 

= -(E T E - Diag{S T S}), = I D iag{S T S} - D, 

n n 

where D is the diagonal matrix with diagonal elements cr?, j = 1, . . . ,p, and for 
a square matrix A, we denote by DiagjA} the matrix with the same dimensions 
as A, the same diagonal elements as A and all off-diagonal elements equal to 
zero. 

Under conditions that will be specified below, the entries of the matrices M^ k ' 
are small with probability close to 1. Bounds on the ^-norms of the matrices 
7\/fO) characterize the stochastic error of the estimation. The accuracy of the 
estimators is determined by these bounds and by the properties of the Gram 
matrix 

* 4 -X T X. 
n 

For a vector 9, we denote by 9j the vector in R p that has the same coordinates as 
9 on the set of indices J c {1, . . . ,p} and zero coordinates on its complement J c . 
We denote by \J\ the cardinality of J. 

To state our results in a general form, we follow Gautier and Tsybakov (2011) 
and introduce the sensitivity characteristics related to the action of the matrix 

on the cone 

Cj±{AeW: IAjcI^IAjU}, 

where J is a subset of {1, . . . ,_p}. For q e [1, oo] and an integer s £ [l,p], we 
define the £ q sensitivity as follows: 

K a (s) = min [ min I^AI I . 

Qy J: \J\<s \AeCj: \A\ q =l* lo< y 

We will also consider the coordinate-wise sensitivities 

nt(s) = min f min I^AI I , 

where A^ is the fcth coordinate of A, k = 1, . . . ,p. To get meaningful bounds 
for various types of estimation errors, we will need the positivity of K q (s) or 
K* k (s). As shown in Gautier and Tsybakov (2011), this requirement is weaker 
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than the usual assumptions related to the structure of the Gram matrix VP, such 
as the Restricted Eigenvalue assumption and the Coherence assumption. For 
completeness, we recall these two assumptions. 

Assumption RE(s). Let 1 < s < p. There exists a constant kre(s) > 
such that 

|A T ^A| 
Aecuo} \Aj\ 2 2 - KM ' 

for all subsets J of {1, . . . ,p} of cardinality \ J\ < s. 

Assumption C. All the diagonal elements of ^ are equal to 1 and all its 
off-diagonal elements of^ij satisfy the coherence condition: max^j < P 
for some p < 1. 

Note that Assumption C with p < (3s)- 1 implies Assumption RE(s) with 
^re(s) = \/l — 3ps, see Bickel et al. (2009) or Lemma 2 in Lounici (2008). From 
Proposition 4.2 of Gautier and Tsybakov (2011) we get that, under Assumption 
C with p < (2s)- 1 , 

Koo(s) > l-2pa, (10) 
which yields the control of the sensitivities n q (s) for all 1 < q < oo since 

^(s)>(2s)- 1 ^k 00 (s), Vl<9<co, (11) 

by Proposition 4.1 of Gautier and Tsybakov (2011). Furthermore, Proposi- 
tion 9.2 of Gautier and Tsybakov (2011) implies that, under Assumption RE(s), 

ki(s) > (4s)- 1 KRE (s), (12) 

and by Proposition 9.3 of that paper, under Assumption RE(2s) for any s < p/2 
and any 1 < q < 2, we have 

Kq (s) > C(q)s- 1 ^ KRE (2s), (13) 
where C{q) = 2- 1 '"- 1 / 2 {l + (q - l)~ 1/q )~\ 

3. Main results 

In this section, we give bounds on the estimation and prediction errors of the 
Compensated MU selector. For e > 0, we consider the thresholds b(e) > and 
^i(e) > 0, i = 1, . . . , 5, such that 

P( max lof - ofl > 5( e )) < e? (14) 
j=i,...,p j j 

and 

P(|M«| 00 >5 i (e))<e, z = l,..., 5. (15) 
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Define 

H{e) = <5i(e) + S 4 (e) + 8 6 {e) + b(e), r(e) = S 2 (e) + 63(e), 
and A(e) = A(fi(e),T(e)), where 

a(u,t) = {e e e : 1 z T ( y - ze) + be <n\eu+T\, V/i,t>o, (16) 

In 00 J 

and is a given subset of W. For e > 0, the Compensated MU selector is 
defined as a solution of the minimization problem 

min{|0|i : G A(s)}, (17) 

We have the following result. 

Theorem 1. Assume that model (l)-(2) is valid with an s-sparse vector of 
parameters 9* € 0, where 6 is a given subset of MP. For e >0, set 

y(e)=2(fi(e) + S 1 (e))\e*\ 1 +2r(s). 

Then, with probability at least 1 — 6s, the set A(e) is not empty and for any 
solution 9 of (17) we have 

\0-9*\ q <^-, Vl<<z<oo, (18) 

K q {S) 

\6k-6* k \<^- V1<K P , (19) 

-\X(9 - 9*)\l < minl^M 2v(e)\9*\A . (20) 
n I Ki(s) j 

The proof of this theorem is given in Section 7. 

Note that (20) contains a bound on the prediction error under no assumption 
on X: 

-|*(0-0*)||<2i/(e)|0*|i. 
n 

The other bounds in Theorem 1 depend on the sensitivities. Using (10) - (13) 
we obtain the following corollary of Theorem 1. 

Theorem 2. Let the assumptions of Theorem 1 be satisfied. Then, with proba- 
bility at least 1 — 6e, for any solution 9 of (17) we have the following inequalities. 

(i) Under Assumption RE (s): 

\9-r\t < (2i) 
l me - nfi s 3!' (22) 



December 20, 2011 



M.Rosenbaum and A. B.Tsybakov/ Matrix Uncertainty Selector 



7 



(ii) Under Assumption RE(2s), s < p/2: 



0-6*\ q < y ' , Vl< 9 <2. (23) 



kre(2s) 

(Hi) Under Assumption C with p < 

(2s) 1 /V(e) 



~ e *U < V o„„ » Vl<g<oo, (24) 



1 - 2ps 

where we set l/oo = 0. 



If the components of £ and S are subgaussian, the values Si (e) are of order 
0(rC x l 2 ) up to logarithmic factors, and the value 6(e) is of the same order in the 
model with missing data (see Section 4) . Then, the bounds for the Compensated 
MU selector in Theorem 2 are decreasing with rate n~ 1//2 as n — > oo. This is 
an advantage of the Compensated MU selector as compared to the original MU 
selector 9 MU , for which the corresponding bounds do not decrease with n and 
can be small only if the noise S is small (cf. Rosenbaum and Tsybakov (2010)). 

If the matrix X is observed without error (S = 0), then fi{e) = 0, Si(s) = 
0, i 7^ 2, and the Compensated MU selector coincides with the Dantzig selector. 
In this particular case, the results (ii) and (iii) of Theorem 2 improve, in terms of 
the constants or the range of validity, upon the corresponding bounds in Bickel 
et al. (2009) and Lounici (2008). 



4. Control of the stochastic error terms 

Theorems 1 and 2 are stated with general thresholds Si(e) and b(e), and can be 
used both for random or deterministic noises £, S (in the latter case, e = 0) and 
random or deterministic X. In this section, considering e > we first derive 
the values 5i(e) for random £ and S with subgaussian entries, and then we 
specify b(e) and the matrix D for the model with missing data. Note that, for 
random £ and S, the values Si(e) and 6(e) characterize the stochastic error of 
the estimator. 

4-1. Thresholds Si(e) under subgaussian noise 

Recall that a zero-mean random variable W is said to be 7-subgaussian (7 > 0) 
if, for all iet, 

E[exp(W)] < exp( 7 2 t 2 /2)- (25) 

In particular, if W is a zero-mean gaussian or bounded random variable, it is sub- 
gaussian. A zero-mean random variable W will be called (7, to)-subexponential 
if there exist 7 > and to > such that 

E[cxp(W)] < exp( 7 2 t 2 /2), V |i| < * . (26) 
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Let the noise terms £ and c satisfy the following assumption. 

Assumption N. Let 7- > 7 7^ > 0. The entries Sy, i = l,...,n, j = 
1, . . . ,p, of the matrix S ore zero-mean j^-subgaussian random variables, the n 
rows of E are independent, and lE(EjjEjfc) = /or j 7^ fc, i = l,...,n. TTie 
components £j o/ifte vector^ are independent zero-mean j^-subgaussian random 
variables satisfying E(Sjj£j) = 0, i = 1, . . . , n, j = 1, . . . ,p. 

Assumption N implies that the random variables Sjj£j, SjjSjfe are subexpo- 
nential. Indeed, if two random variables £ and i] are subgaussian, then for some 
c > we have JEexp(c(n) < 00, which implies that (26) holds for W = (rj with 
some 7, to whenever TE(Cv) — 0, cf., e.g., Petrov (1995), page 56. 

Next, (j — (1/n) X)"=i "fj — o~j is a, zero-mean subexponential random vari- 
able with variance 0(l/n). It is easy to check that (26) holds for W — (j with 
7 = 0(1/ y/n) and t = 0(n). 

To simplify the notation, we will use a rougher evaluation valid under As- 
sumption N, namely that all SjjSjfc are (70, io)-subexponential with the 
same 70 > and i > 0, and all Q are (70/v^-j io n )~ su bcxponential. Here the 
constants 70 and to depend only on 73 and 7^. For < e < 1 and an integer N, 
set 

1, ,n ( /21og(JV/e) 21og(JV/e)\ 

^^ax^^M -^^J ' 

Lemma 1. Lei Assumption N be satisfied, and let X be a deterministic matrix 
with maxi<j< p ^ Y^i=i ^ij ~ m 2- Then for any < e < 1 i/ie bound (15) holds 
with 

x 1 21712 lQ g(V/ £ ) r / 2m 2 log(2p/~e) " 

0i(£)=7h\/ , 02(e) =7£V , (27) 

V n V n 

5 3 (e) = *5(e)=*(e,2p), 04(e) = 5(e,p(p - 1)). (28) 

Proof. Use the union bound and the facts that ¥(W > 5) < exp(— o~ 2 / (27 2 )) for 

a 7-subgaussian W, andP(^ X)"=i VFi > o") < max ( exp(— n5 2 /(2'j 2 )), exp(— 8t n/2)) 

for a sum of independent (7, £ )-subexponential Wj. □ 

^.-2. Data-driven D and b(e) for the model with missing data 

Consider now the model with missing data (7) and assume that X is non- 
random. Then we have Z 2 - = Xf-r^j, which implies: 

nzf j ]=Xf j (l-n j ), j = l,..., p. 

Hence, Z 2 j7Tj/(l — nj) 2 is an unbiased estimator of XfjiTj/(l — ttj). Then a 2 
defined in (9) is naturally estimated by 
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The matrix D is then defined as a diagonal matrix with diagonal entries a? . It is 
not hard to prove that a? approximates cr 2 in probability with rate 0(n -1 / 2 ) up 
to a logarithmic factor. For example, let the probability that the data is missing 
be the same for all j: tti = ■ ■ ■ = ir p = 7r*. Then 



-E4 



y 2 



(1-7T.) 



> 



< 2exp 



-7T.) 

2n6 2 (l-7r >t ) 4 



> & 



7r 2 m 4 



where we have used the fact that < Zfj < -X^(l — 7r*) 2 , Hocffding's inequality 
and the notation 1714 = maxi<j< p ^ Y^ii=i ^tj- This proves (14) with 



b(e) 



ttt-4 log(2p/e) 
2n 



If 7T% is unknown, we replace it by the estimator tt 



-E 



(J 1 {z i3 =o} 5 where 



denotes the indicator function. Another difference is that = Zij/(l — iTj) 
appearing in (8) are not available when 7r/s are unknown. Therefore, we slightly 
modify the estimator using Z^ instead of Z^; we define 8 as a solution of 
min{|0|i : 6 <= A(s)} with 



A(s) = {t 



€ 9 : 



1 -, 



-Z 1 (y(l - tt) - Z6) + D6 



<m\0\i+r{e)}, (30) 



where (1(e) and f(e) are suitably chosen constants, Z is the n x p matrix with 
entries Z^, and £> is a diagonal matrix with entries <r 2 = - E"=i ^^/(l — ^) 2 - 
This modification introduces in the bounds an additional term proportional to 
7r — 71%, which is of the order 0((np) -1 / 2 ) in probability and hence is negligible 
as compared to the error bound for the Compensated MU selector. 

Remark 2. In this section, we have considered non-random X. Using the same 
argument, it is easy to derive analogous expressions for cr^e) and 6(e) when 
X is a random matrix with independent sub-gaussian entries, and £, S are 
independent from X . 



5. Confidence intervals 

The bounds of Theorems 1 and 2 depend on the unknown matrix X via the 
sensitivities, and therefore cannot be used to provide confidence intervals. In 
this section, we show how to address the issue of confidence intervals by deriving 
other type of bounds based on the empirical sensitivities. Note first that the 
matrix \& = -Z T Z — D is a natural estimator of the unknown Gram matrix 

n 

It is -^-consistent i n f^-norm under the conditions of the previous section. 
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Therefore, it makes sense to define the empirical counterparts of n q (s) and K* k (s) 
by the relations: 



k (s) = niin min I^AI 

J: |J|<8 \AeCj: |A|,=1 

and 

k* k (s) = min | min |$A| C 

J: \J\<s \AeCj: A fc =l 

The values and k£(s) that we will call the empirical sensitivities can be 

efficiently computed for small s or, alternatively, one can compute data-driven 
lower bounds on them for any s using linear programming, cf. Gautier and 
Tsybakov (2011). 

The following theorem establishes confidence intervals for s-sparse vector 8* 
based on the empirical sensitivities. 

Theorem 3. Assume that model (l)-(2) is valid with an s-sparse vector of 
parameters 8* e 6 ; where is a given subset ofR p . Then, with probability at 
least 1 — 6e, for any solution 8 of (17) we have 

l'-n.< . f y*;t/M» , Vl<„<oo, (31) 
K q {s)(l- h(s)/k 1 {s))+ 

4( s )( 1 -M(e)/«i(s))+ 
w/iere x + = max(0, x), and we set 1/0 = oo. 

Proof. Set A = 8* -8, and write for brevity S(0) = iZ T (y - Z0) + D8. Using 
Lemma 2 in Section 7, the fact that |Ajc|i < |Aj|i where J is the set of non- 
zero components of 8* (cf. Lemma 1 in Rosenbaum and Tsybakov (2010)) and 
the definition of the empirical sensitivity &i(s), we find 

|*AU < |S(0*)|oo + |S(S)|oo 

< ^s)(\d*\ 1 + \d\ 1 ) + 2T(s) 

< 2(/i(e)|fl| 1 +T( e ))+M(e)|A| 1 

< 2( M ( £ )|^|i+r( £ )) + ^|*A| 00 

This and the definition of K g (s) yield (31). The proof of (32) is analogous, with 
k* k (s) used instead of k q (s). □ 

Remark 3. Note that the bounds (31)-(32) remain valid for s' > s. Therefore, 
if one gets an estimator s of s such that s > s with high probability, it can be 
plugged in into the bounds in order to get completely feasible confidence intervals. 
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6. Simulations 

We consider here the model with missing data (7). Simulations in Rosenbaum 
and Tsybakov (2010) indicate that in this model the MU selector achieves better 
numerical performance than the Lasso or the Dantzig selector. Here we compare 
the MU selector with the Compensated MU selector. We design the numerical 
experiment the following way. 

— We take a matrix X of size 100 x 500 (n = 100, p = 500) which is the normal- 
ized version (centered and then normalized so that all the diagonal elements of 
the associated Gram matrix X T X/n are equal to 1) of a 100 x 500 matrix with 
i.i.d. standard Gaussian entries. 

— For a given integer s, we randomly (uniformly) choose s non-zero elements 
in a vector 9* of size 500. The associated coefficients 9* are set to 0.5, and all 
other coefficients are set to 0. We take s = 1, 2, 3, 5, 10. 

— We set y = X9* + £, where £ a vector with i.i.d. zero mean and variance v 2 
normal components, v — 0.05/1.96. 

— We compute the values Z^ = Zij/(1 — ir*) with Z^ as in (7) , and ttj = 
0.1 = 7T* for all j. (The value it* rather than its empirical counterpart, which is 
very close to 71% , is used in the algorithm to simplify the computations) . 

— We run a linear programming algorithm to compute the solutions of (3) 
and (17) where we optimize over 6 = R^j 00 . To simplify the comparison with 
Rosenbaum and Tsybakov (2010), we write \i in the form (1 + 5)5 with 5 = 
0,0.01,0.05,0.075,0.1. In particular, (5 = corresponds to the Dantzig selector 
based on the noisy matrix Z. In practice, one can use an empirical procedure of 
the choice of 5 described in Rosenbaum and Tsybakov (2010). The choice of r is 
not crucial and influences only slightly the output of the algorithm. The results 
presented below correspond to r chosen in the same way as in the numerical 
study in Rosenbaum and Tsybakov (2010). 

— We compute the error measures 

Em = \6 - 6*\l and Err 2 = \X(6 - 8*)\ 2 2 . 

We also record the retrieved sparsity pattern, which is defined as the set of the 
non-zero coefficients of 9. 

— For each value of s we run 100 Monte Carlo simulations. 

Tables 1-5 present the empirical averages and standard deviations (in brack- 
ets) of Erri, Err 2 , of the number of non-zero coefficients in 9 (Nbi) and of the 
number of non-zero coefficients in 9 belonging to the true sparsity pattern (iV^)- 
We also present the total number of simulations where the sparsity pattern is 
exactly retrieved (Exact). The lines with "5 = v" for v = 0, 0.01, 0.05, 0.075, 0.1 
correspond to the MU selector and those with "C — 5 = v" to the Compensated 
MU selector. 



1 Remark that this experiment slightly differs from those in Rosenbaum and Tsy- 
bakov (2010) where the matrix taken in (3) has entries Zij . 
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liiiTi 


Iirr2 


TVTK 
IN t>i 


IN D2 


Exact 


8 = Q 


0.0196 


1.334 


70.13 


1 







(0.0114) 


(0.5865) 


(10.91) 


(0) 




C-5 = 


0.0225 


1.495 


80.09 


1 







(0.0145) 


(0.6993) 


(8.343) 


(0) 




8 = 0.01 


0.0131 


0.9318 


45.45 


1 


1 




(0.0069) 


(0.3606) 


(9.507) 


(0) 




C-5 = 0.01 


0.0095 


0.8386 


46.88 


1 







(0.0062) 


(0.4625) 


(9.737) 


(0) 




5 = 0.05 


0.0100 


0.8001 


12.45 


1 


3 




(0.0038) 


(0.2121) 


(5.798) 


(0) 




C-5 = 0.05 


0.0042 


0.3412 


10.52 


1 


D 




(0.0027) 


(0.1844) 


(5.764) 


(0) 




5 = 0.075 


0.0100 


0.8878 


6.28 


1 


1 A 
14 




(0.0030) 


(0.1869) 


(4.261) 


(0) 




C-5 = 0.075 


0.0038 


0.3377 


4.91 


1 


Zl 




(0.0020) 


(0.1348) 


(3.674) 


(0) 




5 = 0.1 


0.0110 


1.038 


3.22 


1 


ou 




(0.0024) 


(0.1582) 


(2.640) 


(0) 




C-5 = 0.1 


0.0044 


0.4255 


2.37 


1 


04 




(0.0015) 


(0.1040) 


(2.042) 


(0) 




Tab. 1. Results for the model with missing data, 


S = 1. 




Erri 


Err 2 


Nbi 


Nb 2 


— = 

Exact 


5 = Q 


0.0437 


2.756 


80.04 


2 







(0.0170) 


(1.060) 


(5.149) 


(0) 




C-5 = 


0.0685 


2.951 


92.67 


2 







(0.0275) 


(1.129) 


(3.911) 


(0) 




5 = 0.01 


0.0287 


1.838 


49.29 


2 







(0.0107) 


(0.5423) 


(6.717) 


(0) 




C-5 = 0.01 


0.0201 


1.561 


48.18 


2 







(0.0098) 


(0.6827) 


(6.775) 


(0) 




5 = 0.05 


0.0264 


2.105 


10.35 


2 


1 




(0.0093) 


(0.4960) 


(4.631) 


(0) 




C-5 = 0.05 


0.0125 


0.9796 


7.70 


2 


8 




(0.0066) 


(0.3849) 


(4.092) 


(0) 




5 = 0.075 


0.0301 


2.694 


4.77 


2 


24 




(0.0090) 


(0.5022) 


(2.587) 


(0) 




C-5 = 0.075 


0.0148 


1.359 


3.41 


2 


47 




(0.0052) 


(0.3573) 


(1.924) 


(0) 




(5 = 0.1 


0.0371 


3.521 


2.62 


2 


65 




(0.0086) 


(0.4730) 


(1.046) 


(0) 




C-5 = 0.1 


0.0218 


2.088 


2.28 


2 


77 




(0.0059) 


(0.3853) 


(0.617) 


(0) 





Tab. 2. Results for the model with missing data, s = 2. 
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£iIT2 


IN t>i 


1\TK 
IN D2 


Exact 


5 = 


0.0772 


4.361 


83.95 


3 







(0.0296) 


(1.268) 


(4.177) 


(0) 




C-S = 


0.1480 


4.258 


97.76 


3 







(0.0436) 


(1.253) 


(3.262) 


(0) 




S = 0.01 


0.0493 


2.929 


49.78 


3 







(0.0176) 


(0.7907) 


(6.515) 


(0) 




C-S = 0.01 


0.0351 


2.328 


48.23 


3 







(0.0153) 


(0.8442) 


(6.302) 


(0) 




S = 0.05 


0.0528 


4.295 


9.82 


3 


1 




(0.0166) 


(0.7696) 


(3.907) 


(0) 




C-S = 0.05 


0.0281 


2.343 


7.02 


3 


lo 




(0.0109) 


(0.6360) 


(3.608) 


(0) 




8 = 0.075 


0.0643 


5.842 


5.16 


3 


90. 




(0.0161) 


(0.7865) 


(2.086) 


(0) 




C-5 = 0.075 


0.0384 


3.606 


3.82 


3 


/ 




(0.0106) 


(0.6556) 


(1.177) 


(0) 




5 = 0.1 


0.0814 


7.792 


3.57 


3 






(0.0164) 


(0.7434) 


(0.9618) 


(0) 




C-5 = 0.1 


0.0575 


5.538 


3.13 


3 






(0.0121) 


(0.6554) 


(0.3912) 


(0) 




Tab. 3. Results for the model with missing data, 


s = 3. 
— — 




Erri 


Err 2 


Nbi 


Nb 2 


Exact 


5 = 


0.1470 


6.801 


87.35 


5 







(0.0536) 


(1.686) 


(3.683) 


(0) 




C-S = 


0.3631 


6.114 


104.23 


5 







(0.0802) 


(1.490) 


(4.039) 


(0) 




8 = 0.01 


0.0961 


4.928 


49.64 


5 







(0.0340) 


(1.180) 


(5.527) 


(0) 




C-5 = 0.01 


0.0670 


3.627 


46.69 


5 







(0.0281) 


(1.206) 


(6.298) 


(0) 




8 = 0.05 


0.1375 


11.100 


10.34 


5 


6 




(0.0391) 


(1.557) 


(3.347) 


(0) 




C-8 = 0.05 


0.0864 


7.302 


7.42 


5 


27 




(0.0307) 


(1.475) 


(2.404) 


(0) 




S = 0.075 


0.1769 


15.68 


6.85 


5 


31 




(0.0427) 


(1.548) 


(1.867) 


(0) 




C-5 = 0.075 


0.1311 


11.86 


5.55 


5 


68 




(0.0427) 


(1.737) 


(1.013) 


(0) 




5 = 0.1 


0.2286 


21.19 


5.67 


5 


58 




(0.0455) 


(1.385) 


(1.049) 


(0) 




C-5 = 0.1 


0.1933 


17.71 


5.19 


5 


88 




(0.0595) 


(2.056) 


(0.6114) 


(0) 





Tab. 4. Results for the model with missing data, s = 5. 
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1 ■ . „ 


IN t>i 


IN D2 


Exact 


8 = 


0.4479 


14.56 


92.21 


10 







(0.1407) 


(3.060) 


(2.881) 


(0) 




C-6 = 


1.208 


11.90 


117.23 


10 







(0.1705) 


(2.197) 


(6.532) 


(0) 




5 = 0.01 


0.3512 


13.59 


52.76 


10 







(0.1263) 


(1.997) 


(5.340) 


(0) 




C-6 = 0.01 


0.2921 


10.70 


48.74 


10 







(0.1317) 


(2.049) 


(6.067) 


(0) 




6 = 0.05 


0.7660 


47.13 


20.29 


9.96 







(0.2395) 


(4.389) 


(4.152) 


(0.1959) 




C-6 = 0.05 


0.6919 


41.55 


16.99 


9.94 


1 




(0.2696) 


(5.709) 


(4.241) 


(0.2374) 




S = 0.075 


0.9683 


65.24 


16.78 


9.85 







(0.2721) 


(5.496) 


(3.545) 


(0.4092) 




C-<5 = 0.075 


0.9443 


61.23 


15.00 


9.76 


5 




(0.3067) 


(7.066) 


(3.452) 


(0.5499) 




<5 = 0.1 


1.150 


82.86 


14.84 


9.58 


1 




(0.2807) 


(6.745) 


(2.948) 


(0.6508) 




C-5 = 0.1 


1.157 


80.43 


13.57 


9.39 


11 




(0.3049) 


(8.359) 


(2.804) 


(0.7601) 





Tab. 5. Results for the model with missing data, s = 10. 

The results of the simulations are quite convincing. Indeed, the Compensated 
MU selector improves upon the MU selector with respect to all the considered 
criteria, in particular when 0* is very sparse (s = 1, 2, 3). The order of magnitude 
of the improvement is such that, for the best 5, the errors Err! and Err 2 are 
divided by 2. The improvement is not so significant for larger s, especially for 
s = 10 when the model starts to be not very sparse. For all the values of s, the 
non-zero coefficients of 0* are systematically in the sparsity pattern both of the 
MU selector and of the Compensated MU selector. The total number of non-zero 
coefficients is always smaller (i.e., closer to the correct one) for the Compensated 
MU selector. Finally, note that the best results for the error measures Erri and 
Err2 are obtained with 6 < 0.075, while the sparsity pattern is better retrieved 
for S = 0.1. This reflects a trade-off between estimation and selection. 

7. Proofs 

Proof of Remark 1. It is enough to show that A((j,,t) = B(/j,,t) where 

B(fi, t) = {0eO: 3ueR p such that (0, u) e W(n, r)}. 

Let first (0,u) € W{fi,r). Using the triangle inequality, we easily get that € 
A(h,t). Now take e A(/jl,t). We set 

N = -Z T (y - Z0) + D0 
n 

and consider u EW defined by 

Ui = -Nil { \ Nz \<^ B \ l} - sign(A^)^|^|il { | W .| >/;j | e | l} , 

for i = l,...,p, where U{ and Ni are the ith components of u and N respectively. 
It is easy to check that (0,u) E W(/i,r), which concludes the proof. 
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Proof of Theorem 1. The proof is based on two lemmas. For brevity, we will 
skip the dependence of b(e), Si(e) and v(e) on e. 

Lemma 2. With probability at least 1 — 6e ; we have 9* € A{e). 
Proof. We first write that Z T (y - Z9*) + nD9* is equal to 

- X T E9* + X T £ + S T C - (S T S - Diag{S T S})r 

- (Diag{S T S} - nD)9* + n(D - D)9* . 
By definition of the Si(e) and 6(e), with probability at least 1 — 6e we have 

|-X T srU < \ 1 X T E\ oc \e*\ 1 < 6,19*1, (33) 
n n 

|-X T e|oo + |-S T e|oo <S 2 + 6 3 (34) 
n n 

\-(E T E - Diag{S T S})r U < |-(~ T S - Diag{H r S})| 00 |e*|i < <J 4 |0*|i (35) 
n n 

|(-Diag{S T S} - £>)0*|oo < |-Diag{S T S} - < <5 B |«*|i (36) 

n n 

\(D-D)9*\ OQ <b\9*\ 1 . (37) 
Therefore 9* e .4(e) with probability at least 1 — 6e. □ 
Lemma 3. With probability at least 1 — 6e, for A = 9 — 9* we have 

\-X T XA\ co < v. 
n 

Proof. Throughout the proof, we assume that we are on event of probability at 
least 1 — 6e where inequalities (33) - (37) hold and 9* e 4(e). We have 

\-X T XA\ oc < \-Z T (Z9-~9-y + t)\ co + \-~ T XA\ 0O . 
n n n 

Consequently, 

| -X T XA\ 00 < \ -Z T (Z9 -y)- D9\ x 
n n 

+ \(-Z T E - D)e\ x + \(D- D)6\ x + \-Z T ^\ OQ + |-S T XAU. 
n n n 

Using that 9 e 4(e), we easily get that \^X T XA\ OD is not greater than 

+ 2S 2 + 26 3 + b\e\! + \(^Z T E - D)fl|oo + |^S T XAU. 

Now remark that 

\(-Z T E-D)9\ 00 <\-Z T E-D\ 00 \9\ 1 
n n 

< (|-(S T S - Diag{S T S})| 00 + |-Diag{S T S} - + |-X T ~U) 1^ 

n n n 

< (Si+^ + Wli- 
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Finally, using that 

|-S T XAU < |^ - ^Ul-Jf^SU < <5i(|^|i + 
n n 

together with the fact that |0|i < \0*\i, we obtain the result. □ 

We now proceed to the proof of Theorem 1. The bounds (18) and (19) follow 
from Lemma 3, the fact that |Ajc|i < |Aj|i where J is the set of non-zero 
components of 0* (cf. Lemma 1 in Roscnbaum and Tsybakov (2010)) and the 
definition of the sensitivities K q (s), K* k {s). To prove (20), first note that 

^\XA\ 2 2 < -"-LY^AUAK, (38) 

and use (18) with q = 1 and Lemma 3. This yields the first term under the 
minimum on the right hand side of (20). The second term is obtained again 
from (38), Lemma 3 and the inequality |A|i < |0|i + |0*|i < 2|0*|i. 

Proof of Theorem 2. The bounds (21) and (24) follow by combining (18) with 
(12) and with (10) - (11) respectively. Next, (22) follows from (20) and (12). 
Also, as an easy consequence of (18) and (13) with q = 2 we get 



12 



< 



1/2 



kre(2s) 



Finally, (23) follows from this inequality and (21) using the interpolation formula 
|A|| < |A|^ 9 |A|2 (9_1) for A = - 0*, and the fact that k re (s) > rc RE (2s). 
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